Popularity and average top 10% percentile count over time¶

We can see that the share of reviews for the top 5% and especially top 1% of podcasts started increasing significantly after 2018. This implies that the investments were somewhat worthwhile.

Some questions we need to consider: If Spotify's investment in podcasts (both specific and overall infrastructure) resulted in significant user growth, did most of these users:

  1. Disproportionately listen to these most expensive/most popular podcasts?
  2. If so, did these new users later engage with other podcasts as well?
  3. Were the new users retained over a longer period, or was there a significant drop-off?
Did most of the growth went to the top 1% of podcasts¶
Hypothesis I¶

Did Spotify's investment and overall strategy of focusing on a small number of creators prove effective? Specifically, did the growth rate in popularity of the most popular podcasts (defined as the top 1st percentile based on the number of reviews) exceed that of other podcasts? Based on this question, we formulate our first hypothesis:

H1: The number of reviews for the most popular podcasts is increasing at a faster rate than for the bottom 99% of all podcasts.

To test this hypothesis, follow these steps:

  1. Transform the reviews_by_month_count_df_after_2015 dataframe to show the monthly growth rate for the top 1% and bottom 99% of podcasts.
count
0 1981420

Growth rate for top 1%:
Mean: 0.08 (Std Dev: 0.44)

Growth rate for bottom 99%:
Mean: 0.02 (Std Dev: 0.14)

To decide the appropriate test, the following should be considered:

  1. Data should be normally distributed or the sample size should be large.
  2. Variances of the two groups being compared should be equal.

If these assumptions do not hold, a non-parametric test like the Mann-Whitney U Test should be used.

Shapiro-Wilk Test for Normality:
Test Stat: 0.76
P-value: 0.0 (If p-value < 0.05, data is not normally distributed).
This indicates that a non-parametric test should be used.

Levene Test for Homogeneity of Variances:
Test Stat: 40.53
P-value: 0.0
This also supports the decision to use a non-parametric test.

Mann-Whitney U Test:
U-value: 40.53
P-value: 0.0

The p-value indicates a significant difference in growth rates between the top 1% and bottom 99%. This supports the initial hypothesis.

Comparison of Average Growth Rates:
Average growth for top 1%: 0.08
Average growth for the bottom 99%: 0.02

Distribution of Podcasts by Popularity:¶

Gini coefficient is: 0.9302907076746617

We can further see that the distribution of reviews between podcasts is extremely unevenly distributed. Specifically, the top 1% of all podcasts by review count have 57% of all reviews.

User Engagement Analysis¶

What are the user listening patterns?

Distribution of review count by user: mean: 1.34 median: 1.00 stdev: 1.83 skewness: 98.11 *is extremely high and indicates a very strong rightward skewness. This suggests that most of the data values are clustered around the left, with a few extremely large values on the right.

kurtosis**: 21137.37

*direction and degree of asymmetry. A positive skew indicates that the tail is on the right side of the distribution. ** high kurtosis means more of the variance is the result of infrequent extreme deviations.

Podcast Genre Analysis¶

category podcast_id rating content_length title_length review_count
0 true-crime bf5bf76d5b6ffbf9a31bba4480383b7f 4.353402 265.356595 12.0 31010
1 true-crime bc5ddad3898e0973eb541577d1df8004 3.686242 305.848232 61.0 9587
2 comedy bc5ddad3898e0973eb541577d1df8004 3.686242 305.848232 61.0 9587
3 news f5fce0325ac6a4bf5e191d6608b95797 3.839367 219.777151 20.0 7265
4 true-crime b1a3eb2aa8e82ecbe9c91ed9a963c362 4.247432 249.893554 19.0 6717
... ... ... ... ... ... ...
212367 leisure-animation-manga f0e247111c8985e0c5e14cc8d6442f09 5.000000 173.000000 22.0 1
212368 leisure d0bc8c5bf6f0f1eeda8d5c1c8b38adc9 5.000000 119.000000 17.0 1
212369 leisure-animation-manga f1bf522813566465708ba99c92813c84 5.000000 481.000000 17.0 1
212370 judaism f564e91cdf68e9c51a40fc38b73da7b6 5.000000 228.000000 24.0 1
212371 history c3f0fe1ab04701f43cc02fa0316d23cf 5.000000 27.000000 17.0 1

212372 rows × 6 columns

Total Unique Podcasts
110024
category review_count unique_podcasts
79 society-culture 329054 13731
16 comedy 306950 11864
103 true-crime 154221 1264
20 education 145413 8827
68 religion-spirituality 141541 12095
104 tv-film 133763 6469
8 business 116883 8072
86 sports 113116 7266
59 news 103378 4297
30 health-fitness 96948 6050
15 christianity 84668 7954
0 arts 84494 6078
41 kids-family 66247 2383
38 history 57816 1663
46 leisure 54142 4178
top_level_category review_count unique_podcasts
19 society 437998 19441
4 comedy 333317 12803
2 business 223394 12931
5 education 217351 13005
8 health 184358 8731
21 sports 184103 9280
16 news 179549 6606
24 tv 168752 8285
23 true-crime 154221 1264
17 religion 143055 12246
0 arts 141137 9675
14 leisure 99622 7011
13 kids 88307 3321
3 christianity 84668 7954
15 music 60633 6161
Index(['category', 'podcast_id', 'rating', 'content_length', 'title_length',
       'review_count', 'top_level_category'],
      dtype='object')
array(['society', 'comedy', 'business', 'education', 'health', 'sports',
       'news', 'tv', 'true-crime', 'religion', 'arts', 'leisure', 'kids',
       'christianity', 'music'], dtype=object)
array(['society', 'comedy', 'business', 'education', 'health', 'sports',
       'news', 'tv', 'true-crime', 'religion', 'arts', 'leisure', 'kids',
       'christianity', 'music'], dtype=object)
array(['true-crime', 'comedy', 'news', 'society', 'kids', 'education',
       'religion', 'sports', 'tv', 'health', 'business', 'music', 'arts',
       'christianity', 'leisure'], dtype=object)

Hypothesis II¶

We can see that the proportion of reviews belong to the Top 1% of podcasts varies wildly between genre. Based on this we can check for which categories the proportion of reviews which belong to the top 1% increased the most.

category podcast_id year_month review_count top_level_category
0 business a00018b54eb342567c94dacfb2a3e504 2017-10-31 1 business
1 christianity a00043d34e734b09246d17dc5d56f63c 2019-09-30 1 christianity
2 religion-spirituality a00043d34e734b09246d17dc5d56f63c 2019-09-30 1 religion
3 religion-spirituality a0004b1ef445af9dc84dad1e7821b1e3 2011-08-31 1 religion
4 spirituality a0004b1ef445af9dc84dad1e7821b1e3 2011-08-31 1 spirituality
... ... ... ... ... ...
1247729 news ffff32caeedd6254573ad1cc49852595 2018-02-28 1 news
1247745 arts ffff5db4b5db2d860c49749e5de8a36d 2011-05-31 1 arts
1247759 comedy ffff66f98c1adfc8d0d6c41bb8facfd0 2018-09-30 4 comedy
1247761 education ffff923482740bc21a0fe184865ec2e2 2018-04-30 1 education
1247763 comedy ffffbd44ec5f79d502f16ae372bf2d4f 2021-08-31 1 comedy

151349 rows × 5 columns

Index(['category', 'podcast_id', 'year_month', 'review_count',
       'top_level_category'],
      dtype='object')
top_level_category post_cutoff is_top_1_percent review_count total prop_of_all_reviews
1 arts False True 3574 16557 0.215860
3 arts True True 2285 9112 0.250768
5 buddhism False True 47 184 0.255435
8 business False True 9377 33327 0.281363
10 business True True 4451 19714 0.225779
12 christianity False True 1757 10361 0.169578
14 christianity True True 2201 7094 0.310262
16 comedy False True 8684 30434 0.285339
18 comedy True True 5584 15611 0.357696
20 education False True 6855 24218 0.283054
22 education True True 3860 19103 0.202063
24 fiction False True 841 2451 0.343125
26 fiction True True 1084 3467 0.312662
28 government False True 594 1936 0.306818
30 government True True 127 857 0.148191
32 health False True 4417 17304 0.255259
34 health True True 2611 14157 0.184432
36 hinduism False True 12 34 0.352941
39 history False True 1541 4142 0.372042
41 history True True 393 2545 0.154420
43 islam False True 23 257 0.089494
45 islam True True 33 124 0.266129
47 judaism False True 40 246 0.162602
49 judaism True True 28 258 0.108527
51 kids False True 1808 7357 0.245752
53 kids True True 1572 5565 0.282480
55 leisure False True 3168 10201 0.310558
57 leisure True True 1052 6776 0.155254
59 music False True 2106 8978 0.234573
61 music True True 1551 4620 0.335714
63 news False True 3627 11611 0.312376
65 news True True 3560 10340 0.344294
67 religion False True 3255 17109 0.190251
69 religion True True 2993 10715 0.279328
71 science False True 1011 3372 0.299822
73 science True True 294 2073 0.141823
75 society False True 13055 41444 0.315003
77 society True True 6256 26072 0.239951
79 spirituality False True 1083 4115 0.263183
81 spirituality True True 688 2726 0.252384
83 sports False True 5028 16407 0.306455
85 sports True True 2759 11894 0.231966
87 technology False True 1646 6719 0.244977
89 technology True True 445 1915 0.232376
91 true-crime False True 2325 5044 0.460944
93 true-crime True True 911 5503 0.165546
95 tv False True 4485 16644 0.269466
97 tv True True 2837 8515 0.333177
top_level_category pre_cutoff_ratio post_cutoff_ratio pre_cutoff_review_count post_cutoff_review_count relative_change_in_ratio sum_review_count
0 arts 0.215860 0.250768 3574.0 2285.0 0.161715 5859.0
2 business 0.281363 0.225779 9377.0 4451.0 -0.197555 13828.0
3 christianity 0.169578 0.310262 1757.0 2201.0 0.829611 3958.0
4 comedy 0.285339 0.357696 8684.0 5584.0 0.253585 14268.0
5 education 0.283054 0.202063 6855.0 3860.0 -0.286134 10715.0
6 fiction 0.343125 0.312662 841.0 1084.0 -0.088781 1925.0
7 government 0.306818 0.148191 594.0 127.0 -0.517006 721.0
8 health 0.255259 0.184432 4417.0 2611.0 -0.277472 7028.0
10 history 0.372042 0.154420 1541.0 393.0 -0.584939 1934.0
11 islam 0.089494 0.266129 23.0 33.0 1.973703 56.0
12 judaism 0.162602 0.108527 40.0 28.0 -0.332558 68.0
13 kids 0.245752 0.282480 1808.0 1572.0 0.149449 3380.0
14 leisure 0.310558 0.155254 3168.0 1052.0 -0.500081 4220.0
15 music 0.234573 0.335714 2106.0 1551.0 0.431169 3657.0
16 news 0.312376 0.344294 3627.0 3560.0 0.102177 7187.0
17 religion 0.190251 0.279328 3255.0 2993.0 0.468210 6248.0
18 science 0.299822 0.141823 1011.0 294.0 -0.526975 1305.0
19 society 0.315003 0.239951 13055.0 6256.0 -0.238259 19311.0
20 spirituality 0.263183 0.252384 1083.0 688.0 -0.041032 1771.0
21 sports 0.306455 0.231966 5028.0 2759.0 -0.243067 7787.0
22 technology 0.244977 0.232376 1646.0 445.0 -0.051437 2091.0
23 true-crime 0.460944 0.165546 2325.0 911.0 -0.640854 3236.0
24 tv 0.269466 0.333177 4485.0 2837.0 0.236431 7322.0
top_level_category year_month_first prop_top_1_percent_first year_month_last prop_top_1_percent_last prop_change
15 music 2005-11-30 0.0 2022-12-31 0.891667 0.891667
19 society 2005-11-30 0.0 2022-12-31 0.805195 0.805195
8 health 2005-11-30 0.0 2022-12-31 0.365854 0.365854
0 arts 2005-11-30 0.0 2022-12-31 0.000000 0.000000
13 kids 2005-11-30 0.0 2022-12-31 0.000000 0.000000
23 true-crime 2015-06-30 0.0 2022-12-31 0.000000 0.000000
22 technology 2005-11-30 0.0 2022-10-31 0.000000 0.000000
21 sports 2005-11-30 0.0 2022-12-31 0.000000 0.000000
20 spirituality 2005-11-30 0.0 2022-11-30 0.000000 0.000000
18 science 2005-11-30 0.0 2022-12-31 0.000000 0.000000
17 religion 2005-11-30 0.0 2022-12-31 0.000000 0.000000
16 news 2005-11-30 0.0 2022-11-30 0.000000 0.000000
14 leisure 2005-11-30 0.0 2022-12-31 0.000000 0.000000
12 judaism 2005-12-31 0.0 2022-08-31 0.000000 0.000000
1 buddhism 2005-11-30 0.0 2022-05-31 0.000000 0.000000
11 islam 2005-12-31 0.0 2022-08-31 0.000000 0.000000
10 history 2005-11-30 0.0 2022-11-30 0.000000 0.000000
9 hinduism 2006-11-30 0.0 2021-09-30 0.000000 0.000000
7 government 2005-11-30 0.0 2022-10-31 0.000000 0.000000
6 fiction 2005-12-31 0.0 2022-11-30 0.000000 0.000000
5 education 2005-11-30 0.0 2022-12-31 0.000000 0.000000
4 comedy 2005-11-30 0.0 2022-12-31 0.000000 0.000000
3 christianity 2005-11-30 0.0 2022-12-31 0.000000 0.000000
2 business 2005-11-30 0.0 2022-12-31 0.000000 0.000000
24 tv 2005-11-30 0.0 2022-12-31 0.000000 0.000000
'runs'
column_name data_type
0 run_at text
1 max_rowid integer
2 reviews_added integer
'podcasts'
column_name data_type
0 podcast_id text
1 itunes_id integer
2 slug text
3 itunes_url text
4 title text
'categories'
column_name data_type
0 podcast_id text
1 category text
'reviews'
column_name data_type
0 author_id text
1 podcast_id text
2 created_at text
3 title text
4 content text
5 rating integer
6 created_at_dt timestamp with time zone